# 1. Readme

This dataset accompanies the publication *Analyzing Requirements and Traceability Information to Improve Bug Localization*.
It consists of:

* 15 Sqlite databases representing the input data
* Additional result data only briefly described in the paper
* Training and test data used to train the ABLoTS classifiers

## 2. Project Databases

* Each project is stored in a separate SQLite database.
* All databases share the same structure, explained in the next sub-sections.
* A database contains of 5 tables with the following meaning.

| Table               | Description                                                                                 |
|---------------------|---------------------------------------------------------------------------------------------|
| issue               | Contains the artifacts, i.e. bug reports and requirements extracted from Jira               |
| issue_link          | Contains explicit defined dependency links between artifacts, which was extracted from Jira |
| change_set          | Contains essential commit information extracted from Git                                    |
| code_change         | Provides information about modified files in each commit extraced from Git                  |
| change_set_link     | Represents implementation links from commit to artifacts                                    |

### 2.1 Table'issue'

| Field                   | Description                                                                             |
|-------------------------|-----------------------------------------------------------------------------------------|
| issue_id                | (Primary Key) Unique identifier of the artifact                                         |
| issue_type              | Artifact type                                                                           |
| created_date            | Creation time stamp (UTC) of artifact using ISO8601 format                              |
| fixed_date              | Resolve time stamp (UTC) of artifact using ISO8601 format                               |
| summary_stemmed         | Preprocessed summary of the artifact. preprocessing is explained in the publication     |
| description_stemmed     | Preprocessed description of the artifact. preprocessing is explained in the publication |

### 2.2 Table 'issue_link'

| Field                   | Description                                 |
|-------------------------|---------------------------------------------|
| source_issue_id         | (Primary Key) Identifier of source artifact |
| target_issue_id         | (Primary Key) Identifier of target artifact |
| name                    | Name of the link                            |

### 2.3 Table 'change_set'

| Field                   | Description                                   |
|-------------------------|-----------------------------------------------|
| commit_hash             | (Primary Key) Unique identifier of commit     |
| committed_date          | Commit time stamp (UTC) using ISO8601 format  |

### 2.4 Table 'code_change'

| Field                   | Description                                         |
|-------------------------|-----------------------------------------------------|
| commit_hash             | (Primary Key) Link to parent commit                 |
| file_path               | (Primary Key) Name of the affected source file      |
| sum_added_lines         | Number of added LOC in this commit for this file    |
| sum_removed_lines       | Number of removed LOC in this commit for this file  |

### 2.5 Table 'change_set_link'

| Field                   | Description                     |
|-------------------------|---------------------------------|
| issue_id                | (Primary Key) Link to artifact  |
| commit_hash             | (Primary Key) Link to commit    |

## 3. Detailed results for CollabScore, SimiScore, and TraceScore

* For space reasons, the paper only shows averaged accuracy measures for the three approaches.
* Detailed results in terms of *Top-1*, *Top-5*, *Top-10*, *MAP*, and *MRR* are in the respective `*.csv` file

## 4. Training/Test data for **ABLoTS** classifier

* The training (first 80% of all bugs per project) and testing (remaining 20% of bugs per project) per project is stored in `ablots.zip`
* The data is stored in `ARFF` format, the common format used in [WEKA](http://www.cs.waikato.ac.nz/ml/weka) machine learning library
